This analysis requires several R packages for different aspects of RNA-seq data analysis:
DESeq2: Differential expression analysistidyverse: Data manipulation and visualizationggplot2: Advanced plottingEnhancedVolcano: Enhanced volcano plotspheatmap: Pretty heatmapsRColorBrewer: Color palettesRtsne: t-SNE dimensionality reductionclusterProfiler: Gene set enrichment analysisDOSE: Disease ontologyenrichplot: Enrichment visualizationorg.Hs.eg.db: Human genome annotationspathview: Pathway visualization# Core analysis packages
suppressPackageStartupMessages({
library(DESeq2)
library(ggplot2)
library(tidyverse)
# Visualization packages
library(EnhancedVolcano)
library(ggupset)
library(pheatmap)
library(RColorBrewer)
library(Rtsne)
# Functional analysis packages
library(clusterProfiler)
library(DOSE)
library(enrichplot)
library(org.Hs.eg.db)
library(pathview)
# Additional utilities
library(biomaRt)
library(cowplot)
library(data.table)
library(ggrepel)
library(here)
library(janitor)
library(kableExtra)
library(knitr)
library(Matrix)
library(pdftools)
library(readr)
library(scales)
library(showtext)
library(tibble)
library(viridis)
})This RMarkdown document contains a comprehensive analysis of RNA-sequencing data comparing hypoxic and normoxic conditions in LNCaP and PC3 prostate cancer cell lines(Lu 2020). The analysis includes:
🎯 Analysis Goals:
This comprehensive RNA-seq analysis pipeline is structured in three main stages(He, Lin, and Chen 2022):
Figure 1: Complete workflow of RNA-seq data analysis pipeline
This comprehensive pipeline demonstrates end-to-end RNA-seq analysis, taking you from raw sequencing data to biological insights. The analysis workflow includes:
Note: If you are working with your own RNA-seq experiment data obtained directly from a sequencing facility, you can skip to the section on [Mapping reads using nf-core/rnaseq].
We analyze RNA-seq data from a study published in Nature Communications:
📚 Reference Publication:
Guo et al. Nature
Communications 2019
Focus: Hypoxia response in prostate cancer cell lines
The study examines two prostate cancer cell lines under different oxygen conditions:
| Sample Name | GSM ID | SRX ID | SRR IDs |
|---|---|---|---|
| LNCaP_Empty_Normoxia_rep1 | GSM3145509 | SRX4096735 | SRR7179504-7 |
| LNCaP_Empty_Normoxia_rep2 | GSM3145510 | SRX4096736 | SRR7179508-11 |
| LNCaP_Empty_Hypoxia_rep1 | GSM3145513 | SRX4096739 | SRR7179520-3 |
| LNCaP_Empty_Hypoxia_rep2 | GSM3145514 | SRX4096740 | SRR7179524-7 |
| PC3_siCtrl_Normoxia_rep1 | GSM3145517 | SRX4096743 | SRR7179536 |
| PC3_siCtrl_Normoxia_rep2 | GSM3145518 | SRX4096744 | SRR7179537 |
| PC3_siCtrl_Hypoxia_rep1 | GSM3145521 | SRX4096747 | SRR7179540 |
| PC3_siCtrl_Hypoxia_rep2 | GSM3145522 | SRX4096748 | SRR7179541 |
To download the sequencing data, we’ll use the NCBI’s SRA toolkit, a powerful suite of tools for accessing sequencing data from the Sequence Read Archive.
Installation Note: - On Linux systems:
sudo apt install sra-toolkit - For other systems: Visit the
SRA Toolkit
documentation
Create a file named list.txt containing the SRA run
accessions:
🖥️ Human Genetics Computing Cluster (HGCC)
We will utilize the HGCC’s computational resources for downloading and processing the SRA files into FASTQ format.
First, we’ll download the SRA files using the SRA toolkit:
The downloaded data requires some processing before analysis. We’ll handle the LNCaP and PC3 samples differently due to their sequencing structure.
LNCaP samples have multiple sequencing runs that need to be concatenated:
# Process LNCaP Normoxia samples
echo "Processing LNCaP Normoxia samples..."
cat SRR7179504_pass.fastq.gz SRR7179505_pass.fastq.gz \
SRR7179506_pass.fastq.gz SRR7179507_pass.fastq.gz \
> LNCAP_Normoxia_S1.fastq.gz
cat SRR7179508_pass.fastq.gz SRR7179509_pass.fastq.gz \
SRR7179510_pass.fastq.gz SRR7179511_pass.fastq.gz \
> LNCAP_Normoxia_S2.fastq.gz
# Process LNCaP Hypoxia samples
echo "Processing LNCaP Hypoxia samples..."
cat SRR7179520_pass.fastq.gz SRR7179521_pass.fastq.gz \
SRR7179522_pass.fastq.gz SRR7179523_pass.fastq.gz \
> LNCAP_Hypoxia_S1.fastq.gz
cat SRR7179524_pass.fastq.gz SRR7179525_pass.fastq.gz \
SRR7179526_pass.fastq.gz SRR7179527_pass.fastq.gz \
> LNCAP_Hypoxia_S2.fastq.gzPC3 samples were sequenced in single runs, so we’ll simply rename them:
Remove intermediate files to free up space:
✓ Completion Check: After processing, you should have 8 FASTQ files in your directory: - 4 LNCaP files (2 Normoxia, 2 Hypoxia) - 4 PC3 files (2 Normoxia, 2 Hypoxia)
These files are now ready for alignment to the reference genome!
The nf-core/rnaseq pipeline provides a robust and standardized approach for RNA sequencing analysis(Ewels et al. 2020). It encompasses all necessary steps from raw FASTQ files to gene expression quantification.
Figure 2: NF-core/RNAseq workflow
The pipeline requires a carefully formatted sample sheet that specifies the input files and their characteristics.
Key Requirements: 1️⃣ The first 4 columns must follow a specific format 2️⃣ Auto-detection of single-end vs paired-end data 3️⃣ Flexible additional columns for metadata
The sample sheet should be a CSV file with the following columns:sample,fastq_1,fastq_2,strandedness.
| sample | fastq_1 | fastq_2 | strandedness |
|---|---|---|---|
| LNCAP_Hypoxia_S1 | rawData/LNCAP_Hypoxia_S1.fastq.gz | auto | |
| LNCAP_Hypoxia_S2 | rawData/LNCAP_Hypoxia_S2.fastq.gz | auto | |
| LNCAP_Normoxia_S1 | rawData/LNCAP_Normoxia_S1.fastq.gz | auto | |
| LNCAP_Normoxia_S2 | rawData/LNCAP_Normoxia_S2.fastq.gz | auto | |
| PC3_Hypoxia_S1 | rawData/PC3_Hypoxia_S1.fastq.gz | auto | |
| PC3_Hypoxia_S2 | rawData/PC3_Hypoxia_S2.fastq.gz | auto | |
| PC3_Normoxia_S1 | rawData/PC3_Normoxia_S1.fastq.gz | auto | |
| PC3_Normoxia_S2 | rawData/PC3_Normoxia_S2.fastq.gz | auto |
The pipeline requires two key reference files: 1. FASTA file (genome sequence) 2. GTF file (genome annotation)
We’ll download the latest human reference genome from Ensembl:
# Get latest Ensembl release number
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | \
grep -o '"release":[0-9]*' | cut -d: -f2)
# Download genome sequence
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/fasta/homo_sapiens/dna/\
Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
# Download genome annotation
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/gtf/homo_sapiens/\
Homo_sapiens.GRCh38.${latest_release}.gtf.gz💡 Tip: Always use the latest release of reference files for the most up-to-date genome annotations.
Create a SLURM job script (run_nextflow.sh) to execute
the pipeline:
#!/bin/bash
#SBATCH --job-name=Bulk_RNAseq # Job name
#SBATCH --nodes=1 # Run on a single node
#SBATCH --ntasks=32 # Number of tasks (cores)
#SBATCH --mem=256gb # Memory requirement
#SBATCH --time=24:00:00 # Time limit (24 hours)
#SBATCH --output=./SLURM_OUT/%x_%A.out # Output file
#SBATCH --error=./SLURM_OUT/%x_%A.err # Error file
# Print job info
echo "Starting job at: $(date)"
echo "Running on node: $(hostname)"
echo "Working directory: $(pwd)"
echo "Using ${SLURM_CPUS_ON_NODE} CPU cores"
# Run nf-core/rnaseq pipeline
nextflow run nf-core/rnaseq -r 3.16.0 \
-profile apptainer \
--input rawData/sample_sheet.csv \
--outdir results \
--fasta references/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
--gtf references/Homo_sapiens.GRCh38.99.gtf.gz \
--aligner star_salmonPipeline Parameters: - -r 3.16.0:
Pipeline version - -profile apptainer: Container profile -
--input: Sample sheet location - --outdir:
Output directory - --fasta: Reference genome sequence -
--gtf: Reference genome annotation -
--aligner: STAR with Salmon quantification
Submit the job to SLURM:
executor > local (288)
[4d/e88d76] NFC…na.primary_assembly.fa.gz) | 1 of 1 ✔
[f3/04cb10] NFC…_sapiens.GRCh38.99.gtf.gz) | 1 of 1 ✔
[97/0a1dac] NFC…8.dna.primary_assembly.fa) | 1 of 1 ✔
[ec/1f96a8] NFC…ary_assembly.filtered.gtf) | 1 of 1 ✔
[c8/4075f2] NFC…8.dna.primary_assembly.fa) | 1 of 1 ✔
[b1/2eebb8] NFC…8.dna.primary_assembly.fa) | 1 of 1 ✔
[6b/7aca1c] NFC…8.dna.primary_assembly.fa) | 1 of 1 ✔
[- ] NFC…_SETSTRANDEDNESS:CAT_FASTQ -
[37/51a998] NFC…E:FASTQC (PC3_Normoxia_S2) | 8 of 8 ✔
[89/2db109] NFC…RIMGALORE (PC3_Hypoxia_S2) | 8 of 8 ✔
[2a/6f0f4a] NFC…EX (genome.transcripts.fa) | 1 of 1 ✔
[1b/69a515] NFC…SAMPLE (LNCAP_Normoxia_S2) | 8 of 8 ✔
[de/1db031] NFC…_QUANT (LNCAP_Normoxia_S1) | 8 of 8 ✔
[50/531a76] NFC…_ALIGN (LNCAP_Normoxia_S1) | 8 of 8 ✔
[46/33ae9a] NFC…S_SORT (LNCAP_Normoxia_S1) | 8 of 8 ✔
[c7/5e2724] NFC…_INDEX (LNCAP_Normoxia_S1) | 8 of 8 ✔
[4c/d2378a] NFC…_STATS (LNCAP_Normoxia_S1) | 8 of 8 ✔
[cc/75a2cd] NFC…FLAGSTAT (PC3_Normoxia_S1) | 8 of 8 ✔
[fa/09e140] NFC…DXSTATS (LNCAP_Hypoxia_S2) | 8 of 8 ✔
[60/948088] NFC…_QUANT (LNCAP_Normoxia_S1) | 8 of 8 ✔
[26/d9c087] NFC…LMON:CUSTOM_TX2GENE (null) | 1 of 1 ✔
[3d/2c8362] NFC…AR_SALMON:TXIMETA_TXIMPORT | 1 of 1 ✔
[0a/e83622] NFC…LMON:SE_GENE (all_samples) | 1 of 1 ✔
[cf/5dbd84] NFC…ENGTH_SCALED (all_samples) | 1 of 1 ✔
[64/67bd55] NFC…_GENE_SCALED (all_samples) | 1 of 1 ✔
[18/566451] NFC…E_TRANSCRIPT (all_samples) | 1 of 1 ✔
[fe/0cfd95] NFC…ASEQ:DESEQ2_QC_STAR_SALMON | 1 of 1 ✔
[28/ca3860] NFC…ICATES (LNCAP_Normoxia_S1) | 8 of 8 ✔
[60/c7e69b] NFC…_INDEX (LNCAP_Normoxia_S1) | 8 of 8 ✔
[82/03ce63] NFC…OLS_STATS (PC3_Hypoxia_S1) | 8 of 8 ✔
[ee/c48141] NFC…FLAGSTAT (PC3_Normoxia_S1) | 8 of 8 ✔
[18/d3254b] NFC…_IDXSTATS (PC3_Hypoxia_S1) | 8 of 8 ✔
[2b/777e77] NFC…INGTIE (LNCAP_Normoxia_S1) | 8 of 8 ✔
[35/91ff22] NFC…COUNTS (LNCAP_Normoxia_S1) | 8 of 8 ✔
[f7/4a9a08] NFC…IOTYPE (LNCAP_Normoxia_S1) | 8 of 8 ✔
[d8/563ed2] NFC…COV_FW (LNCAP_Normoxia_S1) | 8 of 8 ✔
[f1/0c38f3] NFC…OV_REV (LNCAP_Normoxia_S1) | 8 of 8 ✔
[f4/ca548d] NFC…EDCLIP (LNCAP_Normoxia_S1) | 8 of 8 ✔
[a6/a43b1a] NFC…BIGWIG (LNCAP_Normoxia_S1) | 8 of 8 ✔
[81/0c3b16] NFC…EDCLIP (LNCAP_Normoxia_S1) | 8 of 8 ✔
[cd/2fa162] NFC…BIGWIG (LNCAP_Normoxia_S1) | 8 of 8 ✔
[4a/005a31] NFC…RNASEQ (LNCAP_Normoxia_S1) | 8 of 8 ✔
[48/1ba811] NFC…PRADAR (LNCAP_Normoxia_S1) | 8 of 8 ✔
[db/e887db] NFC…AMSTAT (LNCAP_Normoxia_S1) | 8 of 8 ✔
[dc/943607] NFC…STANCE (LNCAP_Normoxia_S1) | 8 of 8 ✔
[81/a0ba9f] NFC…RIMENT (LNCAP_Normoxia_S1) | 8 of 8 ✔
[45/944d35] NFC…TATION (LNCAP_Normoxia_S1) | 8 of 8 ✔
[c6/32ba39] NFC…RATION (LNCAP_Normoxia_S1) | 8 of 8 ✔
[69/38729b] NFC…BUTION (LNCAP_Normoxia_S1) | 8 of 8 ✔
[6d/91dec1] NFC…CATION (LNCAP_Normoxia_S1) | 8 of 8 ✔
[6c/b8daaf] NFC…_RNASEQ:RNASEQ:MULTIQC (1) | 1 of 1 ✔
-[nf-core/rnaseq] Pipeline completed successfully -
Completed at: 30-May-2025 22:26:13
Duration : 2h 50m 30s
CPU hours : 63.9
Succeeded : 288
After the script has finished running, the output folders containing
the alignment files for each sample should be found in the directory
star_salmon/.
star_salmon/ directory, we will find a separate
folder of results for each of the FASTQ files you mapped. Inside this
folder, you should see the following files:
.
├── bigwig/
│ ├── LNCAP_Hypoxia_S1.forward.bigWig
│ ├── LNCAP_Hypoxia_S1.reverse.bigWig
│ ├── LNCAP_Hypoxia_S2.forward.bigWig
│ ├── LNCAP_Hypoxia_S2.reverse.bigWig
│ ├── LNCAP_Normoxia_S1.forward.bigWig
│ ├── LNCAP_Normoxia_S1.reverse.bigWig
│ ├── LNCAP_Normoxia_S2.forward.bigWig
│ ├── LNCAP_Normoxia_S2.reverse.bigWig
│ ├── PC3_Hypoxia_S1.forward.bigWig
│ ├── PC3_Hypoxia_S1.reverse.bigWig
│ ├── PC3_Hypoxia_S2.forward.bigWig
│ ├── PC3_Hypoxia_S2.reverse.bigWig
│ ├── PC3_Normoxia_S1.forward.bigWig
│ ├── PC3_Normoxia_S1.reverse.bigWig
│ ├── PC3_Normoxia_S2.forward.bigWig
│ └── PC3_Normoxia_S2.reverse.bigWig
├── deseq2_qc/
│ ├── size_factors/
│ ├── deseq2.dds.RData
│ ├── deseq2.pca.vals.txt
│ ├── deseq2.plots.pdf
│ ├── deseq2.sample.dists.txt
│ └── R_sessionInfo.log
├── dupradar/
│ ├── box_plot/
│ ├── gene_data/
│ ├── histogram/
│ ├── intercepts_slope/
│ └── scatter_plot/
├── featurecounts/
│ ├── LNCAP_Hypoxia_S1.biotype_counts_mqc.tsv
│ ├── LNCAP_Hypoxia_S1.biotype_counts_rrna_mqc.tsv
│ ├── LNCAP_Hypoxia_S1.featureCounts.txt
│ ├── LNCAP_Hypoxia_S1.featureCounts.txt.summary
│ ├── LNCAP_Hypoxia_S2.biotype_counts_mqc.tsv
│ ├── LNCAP_Hypoxia_S2.biotype_counts_rrna_mqc.tsv
│ ├── LNCAP_Hypoxia_S2.featureCounts.txt
│ ├── LNCAP_Hypoxia_S2.featureCounts.txt.summary
│ ├── LNCAP_Normoxia_S1.biotype_counts_mqc.tsv
│ ├── LNCAP_Normoxia_S1.biotype_counts_rrna_mqc.tsv
│ ├── LNCAP_Normoxia_S1.featureCounts.txt
│ ├── LNCAP_Normoxia_S1.featureCounts.txt.summary
│ ├── LNCAP_Normoxia_S2.biotype_counts_mqc.tsv
│ ├── LNCAP_Normoxia_S2.biotype_counts_rrna_mqc.tsv
│ ├── LNCAP_Normoxia_S2.featureCounts.txt
│ ├── LNCAP_Normoxia_S2.featureCounts.txt.summary
│ ├── PC3_Hypoxia_S1.biotype_counts_mqc.tsv
│ ├── PC3_Hypoxia_S1.biotype_counts_rrna_mqc.tsv
│ ├── PC3_Hypoxia_S1.featureCounts.txt
│ ├── PC3_Hypoxia_S1.featureCounts.txt.summary
│ ├── PC3_Hypoxia_S2.biotype_counts_mqc.tsv
│ ├── PC3_Hypoxia_S2.biotype_counts_rrna_mqc.tsv
│ ├── PC3_Hypoxia_S2.featureCounts.txt
│ ├── PC3_Hypoxia_S2.featureCounts.txt.summary
│ ├── PC3_Normoxia_S1.biotype_counts_mqc.tsv
│ ├── PC3_Normoxia_S1.biotype_counts_rrna_mqc.tsv
│ ├── PC3_Normoxia_S1.featureCounts.txt
│ ├── PC3_Normoxia_S1.featureCounts.txt.summary
│ ├── PC3_Normoxia_S2.biotype_counts_mqc.tsv
│ ├── PC3_Normoxia_S2.biotype_counts_rrna_mqc.tsv
│ ├── PC3_Normoxia_S2.featureCounts.txt
│ └── PC3_Normoxia_S2.featureCounts.txt.summary
├── LNCAP_Hypoxia_S1/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── LNCAP_Hypoxia_S2/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── LNCAP_Normoxia_S1/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── LNCAP_Normoxia_S2/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── log/
│ ├── LNCAP_Hypoxia_S1.Log.final.out
│ ├── LNCAP_Hypoxia_S1.Log.out
│ ├── LNCAP_Hypoxia_S1.Log.progress.out
│ ├── LNCAP_Hypoxia_S1.SJ.out.tab
│ ├── LNCAP_Hypoxia_S2.Log.final.out
│ ├── LNCAP_Hypoxia_S2.Log.out
│ ├── LNCAP_Hypoxia_S2.Log.progress.out
│ ├── LNCAP_Hypoxia_S2.SJ.out.tab
│ ├── LNCAP_Normoxia_S1.Log.final.out
│ ├── LNCAP_Normoxia_S1.Log.out
│ ├── LNCAP_Normoxia_S1.Log.progress.out
│ ├── LNCAP_Normoxia_S1.SJ.out.tab
│ ├── LNCAP_Normoxia_S2.Log.final.out
│ ├── LNCAP_Normoxia_S2.Log.out
│ ├── LNCAP_Normoxia_S2.Log.progress.out
│ ├── LNCAP_Normoxia_S2.SJ.out.tab
│ ├── PC3_Hypoxia_S1.Log.final.out
│ ├── PC3_Hypoxia_S1.Log.out
│ ├── PC3_Hypoxia_S1.Log.progress.out
│ ├── PC3_Hypoxia_S1.SJ.out.tab
│ ├── PC3_Hypoxia_S2.Log.final.out
│ ├── PC3_Hypoxia_S2.Log.out
│ ├── PC3_Hypoxia_S2.Log.progress.out
│ ├── PC3_Hypoxia_S2.SJ.out.tab
│ ├── PC3_Normoxia_S1.Log.final.out
│ ├── PC3_Normoxia_S1.Log.out
│ ├── PC3_Normoxia_S1.Log.progress.out
│ ├── PC3_Normoxia_S1.SJ.out.tab
│ ├── PC3_Normoxia_S2.Log.final.out
│ ├── PC3_Normoxia_S2.Log.out
│ ├── PC3_Normoxia_S2.Log.progress.out
│ └── PC3_Normoxia_S2.SJ.out.tab
├── PC3_Hypoxia_S1/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── PC3_Hypoxia_S2/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── PC3_Normoxia_S1/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── PC3_Normoxia_S2/
│ ├── aux_info/
│ ├── libParams/
│ ├── logs/
│ ├── cmd_info.json
│ ├── quant.genes.sf
│ └── quant.sf
├── picard_metrics/
│ ├── LNCAP_Hypoxia_S1.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── LNCAP_Hypoxia_S2.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── LNCAP_Normoxia_S1.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── LNCAP_Normoxia_S2.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── PC3_Hypoxia_S1.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── PC3_Hypoxia_S2.markdup.sorted.MarkDuplicates.metrics.txt
│ ├── PC3_Normoxia_S1.markdup.sorted.MarkDuplicates.metrics.txt
│ └── PC3_Normoxia_S2.markdup.sorted.MarkDuplicates.metrics.txt
├── qualimap/
│ ├── LNCAP_Hypoxia_S1/
│ ├── LNCAP_Hypoxia_S2/
│ ├── LNCAP_Normoxia_S1/
│ ├── LNCAP_Normoxia_S2/
│ ├── PC3_Hypoxia_S1/
│ ├── PC3_Hypoxia_S2/
│ ├── PC3_Normoxia_S1/
│ └── PC3_Normoxia_S2/
├── rseqc/
│ ├── bam_stat/
│ ├── infer_experiment/
│ ├── inner_distance/
│ ├── junction_annotation/
│ ├── junction_saturation/
│ ├── read_distribution/
│ └── read_duplication/
├── samtools_stats/
│ ├── LNCAP_Hypoxia_S1.markdup.sorted.bam.flagstat
│ ├── LNCAP_Hypoxia_S1.markdup.sorted.bam.idxstats
│ ├── LNCAP_Hypoxia_S1.markdup.sorted.bam.stats
│ ├── LNCAP_Hypoxia_S1.sorted.bam.flagstat
│ ├── LNCAP_Hypoxia_S1.sorted.bam.idxstats
│ ├── LNCAP_Hypoxia_S1.sorted.bam.stats
│ ├── LNCAP_Hypoxia_S2.markdup.sorted.bam.flagstat
│ ├── LNCAP_Hypoxia_S2.markdup.sorted.bam.idxstats
│ ├── LNCAP_Hypoxia_S2.markdup.sorted.bam.stats
│ ├── LNCAP_Hypoxia_S2.sorted.bam.flagstat
│ ├── LNCAP_Hypoxia_S2.sorted.bam.idxstats
│ ├── LNCAP_Hypoxia_S2.sorted.bam.stats
│ ├── LNCAP_Normoxia_S1.markdup.sorted.bam.flagstat
│ ├── LNCAP_Normoxia_S1.markdup.sorted.bam.idxstats
│ ├── LNCAP_Normoxia_S1.markdup.sorted.bam.stats
│ ├── LNCAP_Normoxia_S1.sorted.bam.flagstat
│ ├── LNCAP_Normoxia_S1.sorted.bam.idxstats
│ ├── LNCAP_Normoxia_S1.sorted.bam.stats
│ ├── LNCAP_Normoxia_S2.markdup.sorted.bam.flagstat
│ ├── LNCAP_Normoxia_S2.markdup.sorted.bam.idxstats
│ ├── LNCAP_Normoxia_S2.markdup.sorted.bam.stats
│ ├── LNCAP_Normoxia_S2.sorted.bam.flagstat
│ ├── LNCAP_Normoxia_S2.sorted.bam.idxstats
│ ├── LNCAP_Normoxia_S2.sorted.bam.stats
│ ├── PC3_Hypoxia_S1.markdup.sorted.bam.flagstat
│ ├── PC3_Hypoxia_S1.markdup.sorted.bam.idxstats
│ ├── PC3_Hypoxia_S1.markdup.sorted.bam.stats
│ ├── PC3_Hypoxia_S1.sorted.bam.flagstat
│ ├── PC3_Hypoxia_S1.sorted.bam.idxstats
│ ├── PC3_Hypoxia_S1.sorted.bam.stats
│ ├── PC3_Hypoxia_S2.markdup.sorted.bam.flagstat
│ ├── PC3_Hypoxia_S2.markdup.sorted.bam.idxstats
│ ├── PC3_Hypoxia_S2.markdup.sorted.bam.stats
│ ├── PC3_Hypoxia_S2.sorted.bam.flagstat
│ ├── PC3_Hypoxia_S2.sorted.bam.idxstats
│ ├── PC3_Hypoxia_S2.sorted.bam.stats
│ ├── PC3_Normoxia_S1.markdup.sorted.bam.flagstat
│ ├── PC3_Normoxia_S1.markdup.sorted.bam.idxstats
│ ├── PC3_Normoxia_S1.markdup.sorted.bam.stats
│ ├── PC3_Normoxia_S1.sorted.bam.flagstat
│ ├── PC3_Normoxia_S1.sorted.bam.idxstats
│ ├── PC3_Normoxia_S1.sorted.bam.stats
│ ├── PC3_Normoxia_S2.markdup.sorted.bam.flagstat
│ ├── PC3_Normoxia_S2.markdup.sorted.bam.idxstats
│ ├── PC3_Normoxia_S2.markdup.sorted.bam.stats
│ ├── PC3_Normoxia_S2.sorted.bam.flagstat
│ ├── PC3_Normoxia_S2.sorted.bam.idxstats
│ └── PC3_Normoxia_S2.sorted.bam.stats
├── stringtie/
│ ├── LNCAP_Hypoxia_S1.ballgown/
│ ├── LNCAP_Hypoxia_S2.ballgown/
│ ├── LNCAP_Normoxia_S1.ballgown/
│ ├── LNCAP_Normoxia_S2.ballgown/
│ ├── PC3_Hypoxia_S1.ballgown/
│ ├── PC3_Hypoxia_S2.ballgown/
│ ├── PC3_Normoxia_S1.ballgown/
│ ├── PC3_Normoxia_S2.ballgown/
│ ├── LNCAP_Hypoxia_S1.coverage.gtf
│ ├── LNCAP_Hypoxia_S1.gene.abundance.txt
│ ├── LNCAP_Hypoxia_S1.transcripts.gtf
│ ├── LNCAP_Hypoxia_S2.coverage.gtf
│ ├── LNCAP_Hypoxia_S2.gene.abundance.txt
│ ├── LNCAP_Hypoxia_S2.transcripts.gtf
│ ├── LNCAP_Normoxia_S1.coverage.gtf
│ ├── LNCAP_Normoxia_S1.gene.abundance.txt
│ ├── LNCAP_Normoxia_S1.transcripts.gtf
│ ├── LNCAP_Normoxia_S2.coverage.gtf
│ ├── LNCAP_Normoxia_S2.gene.abundance.txt
│ ├── LNCAP_Normoxia_S2.transcripts.gtf
│ ├── PC3_Hypoxia_S1.coverage.gtf
│ ├── PC3_Hypoxia_S1.gene.abundance.txt
│ ├── PC3_Hypoxia_S1.transcripts.gtf
│ ├── PC3_Hypoxia_S2.coverage.gtf
│ ├── PC3_Hypoxia_S2.gene.abundance.txt
│ ├── PC3_Hypoxia_S2.transcripts.gtf
│ ├── PC3_Normoxia_S1.coverage.gtf
│ ├── PC3_Normoxia_S1.gene.abundance.txt
│ ├── PC3_Normoxia_S1.transcripts.gtf
│ ├── PC3_Normoxia_S2.coverage.gtf
│ ├── PC3_Normoxia_S2.gene.abundance.txt
│ └── PC3_Normoxia_S2.transcripts.gtf
├── LNCAP_Hypoxia_S1.markdup.sorted.bam
├── LNCAP_Hypoxia_S1.markdup.sorted.bam.bai
├── LNCAP_Hypoxia_S2.markdup.sorted.bam
├── LNCAP_Hypoxia_S2.markdup.sorted.bam.bai
├── LNCAP_Normoxia_S1.markdup.sorted.bam
├── LNCAP_Normoxia_S1.markdup.sorted.bam.bai
├── LNCAP_Normoxia_S2.markdup.sorted.bam
├── LNCAP_Normoxia_S2.markdup.sorted.bam.bai
├── null.merged.gene_counts_length_scaled.SummarizedExperiment.rds
├── null.merged.gene_counts_scaled.SummarizedExperiment.rds
├── null.merged.gene_counts.SummarizedExperiment.rds
├── null.merged.transcript_counts.SummarizedExperiment.rds
├── PC3_Hypoxia_S1.markdup.sorted.bam
├── PC3_Hypoxia_S1.markdup.sorted.bam.bai
├── PC3_Hypoxia_S2.markdup.sorted.bam
├── PC3_Hypoxia_S2.markdup.sorted.bam.bai
├── PC3_Normoxia_S1.markdup.sorted.bam
├── PC3_Normoxia_S1.markdup.sorted.bam.bai
├── PC3_Normoxia_S2.markdup.sorted.bam
├── PC3_Normoxia_S2.markdup.sorted.bam.bai
├── salmon.merged.gene_counts_length_scaled.tsv
├── salmon.merged.gene_counts_scaled.tsv
├── salmon.merged.gene_counts.tsv
├── salmon.merged.gene_lengths.tsv
├── salmon.merged.gene_tpm.tsv
├── salmon.merged.transcript_counts.tsv
├── salmon.merged.transcript_lengths.tsv
├── salmon.merged.transcript_tpm.tsv
└── tx2gene.tsv
DESeq2 is a robust method for differential expression analysis(Love, Huber, and Anders 2014) that:
🎯 Analysis Objectives: 1. 🔍 Identify genes differentially expressed between: - 🌡️ Hypoxic vs normoxic conditions - 🧫 LNCaP and PC3 cell lines 2. 🔄 Account for batch effects and experimental variables 3. 📊 Generate visualizations to validate results
Let’s start by examining our count data from the RNA-seq pipeline:
# Read and process count data
data <- read.table(
"data/salmon.merged.gene_counts_length_scaled.tsv",
header = TRUE,
row.names = 1
)
# Sort columns and round to integers
data <- data[, sort(colnames(data))]
data_round <- round(data[2:9], digits = 0)
# Preview the data
head(data_round)Data Structure: - 60,676 rows (genes) - 8 columns (samples) - Values represent raw read counts per gene per sample
Let’s check the total number of reads (sequencing depth) for each sample:
# Calculate total reads per sample
total_reads <- colSums(data_round[1:8])
# Create a data frame for display
reads_df <- data.frame(
Sample = names(total_reads),
Total_Reads = total_reads,
Total_Reads_Millions = round(total_reads / 1e6, 2),
row.names = NULL
)
reads_df Sequencing Depth Analysis:
Sample depth varies considerably: - Highest: LNCAP_Normoxia_S2 (~57M reads) - Lowest: PC3_Normoxia_S1 (~17M reads)
Key Observations: 1. Over 3-fold difference between highest and lowest depth 2. PC3_Hypoxia_S2 has notably low depth (~17M reads) 3. Most samples have 40-55M reads
Implications: - Direct comparison of raw counts not possible - Normalization required to account for depth differences - DESeq2 will handle this through size factor estimation
Before proceeding with differential expression analysis, we need to account for:
💡 Note: DESeq2 handles normalization through: - Size factor estimation - Variance stabilizing transformation - Dispersion estimation
A DESeq2 analysis requires three key components:
Required Components:
countData
colData
design Formula
First, let’s define our experimental conditions:
Next, create a data frame with sample information:
# Create sample metadata data frame
my_colData <- data.frame(
rownames = colnames(data_round)[1:8], # Sample names
condition = factor(condition) # Experimental conditions
)
# Reorder columns and display
my_colData <- my_colData[, c("rownames", "condition")]
my_colDataWith our count data and sample metadata prepared, we can create the DESeq2 object:
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(
countData = data_round, # Count matrix
colData = my_colData, # Sample metadata
design = ~ condition # Experimental design
)
# Display object information
dds#> class: DESeqDataSet
#> dim: 60676 8
#> metadata(1): version
#> assays(1): counts
#> rownames(60676): ENSG00000000003 ENSG00000000005 ... ENSG00000288587 ENSG00000288588
#> rowData names(0):
#> colnames(8): LNCAP_Hypoxia_S1 LNCAP_Hypoxia_S2 ... PC3_Normoxia_S1 PC3_Normoxia_S2
#> colData names(2): rownames condition
The DESeq() function performs the core differential
expression analysis through several steps:
Analysis Steps:
Let’s run the analysis:
#> class: DESeqDataSet
#> dim: 60676 8
#> metadata(1): version
#> assays(4): counts mu H cooks
#> rownames(60676): ENSG00000000003 ENSG00000000005 ... ENSG00000288587 ENSG00000288588
#> rowData names(30): baseMean baseVar ... deviance maxCooks
#> colnames(8): LNCAP_Hypoxia_S1 LNCAP_Hypoxia_S2 ... PC3_Normoxia_S1 PC3_Normoxia_S2
#> colData names(3): rownames condition sizeFactor
The DESeq2 object contains various data components that we can inspect:
Available Data: - Raw counts - Normalized counts - Size factors - Dispersions - Model coefficients - Test statistics
Let’s examine the raw count matrix:
| LNCAP_Hypoxia_S1 | LNCAP_Hypoxia_S2 | LNCAP_Normoxia_S1 | LNCAP_Normoxia_S2 | PC3_Hypoxia_S1 | PC3_Hypoxia_S2 | PC3_Normoxia_S1 | PC3_Normoxia_S2 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 694 | 652 | 366 | 332 | 1087 | 384 | 336 | 865 |
| ENSG00000000005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000000419 | 4317 | 4761 | 4969 | 5939 | 3127 | 995 | 962 | 2374 |
| ENSG00000000457 | 725 | 789 | 566 | 648 | 99 | 30 | 30 | 94 |
| ENSG00000000460 | 351 | 393 | 419 | 407 | 454 | 206 | 258 | 601 |
| ENSG00000000938 | 2 | 1 | 2 | 2 | 0 | 0 | 0 | 2 |
Key Observations: 1. Counts vary widely between genes 2. Some genes show consistent expression patterns 3. Many genes have zero counts in some samples 4. Expression patterns differ between cell lines
DESeq2 provides size-factor normalized counts that account for sequencing depth differences:
| LNCAP_Hypoxia_S1 | LNCAP_Hypoxia_S2 | LNCAP_Normoxia_S1 | LNCAP_Normoxia_S2 | PC3_Hypoxia_S1 | PC3_Hypoxia_S2 | PC3_Normoxia_S1 | PC3_Normoxia_S2 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 385.26 | 313.23 | 192.03 | 144.41 | 1382 | 1302 | 990.4 | 1017.99 |
| ENSG00000000005 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0.0 | 0.00 |
| ENSG00000000419 | 2396.52 | 2287.27 | 2607.16 | 2583.34 | 3975 | 3373 | 2835.6 | 2793.88 |
| ENSG00000000457 | 402.47 | 379.05 | 296.97 | 281.87 | 126 | 102 | 88.4 | 110.63 |
| ENSG00000000460 | 194.85 | 188.81 | 219.84 | 177.04 | 577 | 698 | 760.5 | 707.30 |
| ENSG00000000938 | 1.11 | 0.48 | 1.05 | 0.87 | 0 | 0 | 0.0 | 2.35 |
Accessing Count Data:
| LNCAP_Hypoxia_S1 | LNCAP_Hypoxia_S2 | LNCAP_Normoxia_S1 | LNCAP_Normoxia_S2 | PC3_Hypoxia_S1 | PC3_Hypoxia_S2 | PC3_Normoxia_S1 | PC3_Normoxia_S2 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 694 | 652 | 366 | 332 | 1087 | 384 | 336 | 865 |
| ENSG00000000005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000000419 | 4317 | 4761 | 4969 | 5939 | 3127 | 995 | 962 | 2374 |
| ENSG00000000457 | 725 | 789 | 566 | 648 | 99 | 30 | 30 | 94 |
| ENSG00000000460 | 351 | 393 | 419 | 407 | 454 | 206 | 258 | 601 |
| ENSG00000000938 | 2 | 1 | 2 | 2 | 0 | 0 | 0 | 2 |
| LNCAP_Hypoxia_S1 | LNCAP_Hypoxia_S2 | LNCAP_Normoxia_S1 | LNCAP_Normoxia_S2 | PC3_Hypoxia_S1 | PC3_Hypoxia_S2 | PC3_Normoxia_S1 | PC3_Normoxia_S2 | |
|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 385.26 | 313.23 | 192.03 | 144.41 | 1382 | 1302 | 990.4 | 1017.99 |
| ENSG00000000005 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0.0 | 0.00 |
| ENSG00000000419 | 2396.52 | 2287.27 | 2607.16 | 2583.34 | 3975 | 3373 | 2835.6 | 2793.88 |
| ENSG00000000457 | 402.47 | 379.05 | 296.97 | 281.87 | 126 | 102 | 88.4 | 110.63 |
| ENSG00000000460 | 194.85 | 188.81 | 219.84 | 177.04 | 577 | 698 | 760.5 | 707.30 |
| ENSG00000000938 | 1.11 | 0.48 | 1.05 | 0.87 | 0 | 0 | 0.0 | 2.35 |
#> LNCAP_Hypoxia_S1 LNCAP_Hypoxia_S2 LNCAP_Normoxia_S1 LNCAP_Normoxia_S2 PC3_Hypoxia_S1 PC3_Hypoxia_S2
#> 1.801 2.082 1.906 2.299 0.787 0.295
#> PC3_Normoxia_S1 PC3_Normoxia_S2
#> 0.339 0.850
Effects of Normalization:
These normalized counts provide a foundation for: - Comparing expression between samples - Visualizing gene expression patterns - Identifying differentially expressed genes
To make our analysis more interpretable, we’ll convert Ensembl IDs to common gene names using BioMart(Durinck et al. 2009):
BioMart Data Export Steps:
Import and examine the annotation data:
# Read annotation file
annotation <- read.csv(
"reference_genomes/GRCh38.p14_annotation.csv",
header = TRUE,
stringsAsFactors = FALSE
)
# Preview annotations
head(annotation)Annotation Information: - Gene stable ID: Unique Ensembl identifier - Gene name: Common gene symbol - Gene type: Biological classification
We’ll merge the normalized counts with gene annotations using tidyverse functions:
# Convert normalized counts to data frame with Ensembl IDs
normalized_counts <- rownames_to_column(
as.data.frame(normalized_counts),
var = "ensembl_id"
)
# Join annotations with expression data
annotated_data <- right_join(
annotation,
normalized_counts,
by = c("Gene.stable.ID" = "ensembl_id")
)
# Preview merged data
head(annotated_data)
# Save annotated results
write.csv(
annotated_data,
file = "results/gene_annotated_normalized_counts.csv",
row.names = FALSE
)Features of Merged Dataset:
The annotated expression data provides several advantages: - Easy gene lookup by symbol - Biological context through gene types - Normalized values for direct comparisons - Export-ready format for collaboration
Next, we’ll explore visualizations of this data to understand expression patterns and identify differentially expressed genes.
Before proceeding with differential expression analysis, it’s crucial to assess sample-to-sample variability to: - Identify potential outliers - Validate experimental design - Confirm replicate similarity - Detect batch effects
Key Questions to Address: 1. How similar are biological replicates? 2. Do samples cluster by experimental condition? 3. Are there any obvious batch effects? 4. Which genes show high variability?
We’ll use variance stabilizing transformation (VST) to prepare data for visualization:
# Apply variance stabilizing transformation
vsd <- vst(dds, blind = TRUE)
# Confirm transformation
print(sprintf("Number of genes transformed: %d", nrow(vsd)))#> [1] "Number of genes transformed: 60676"
💡 Why VST? - Reduces mean-dependent variance - Makes data more suitable for visualization - Enables comparison across wide expression ranges - Preserves relative differences between samples
We’ll create a heatmap of sample-to-sample distances to visualize relationships between samples:
Method Overview: 1. Calculate Euclidean distances between samples 2. Create distance matrix 3. Perform hierarchical clustering 4. Visualize as heatmap
# Function to plot sample distances
plotDists <- function(vsd.obj) {
# Calculate sample distances
sampleDists <- dist(t(assay(vsd.obj)))
sampleDistMatrix <- as.matrix(sampleDists)
# Add condition labels
rownames(sampleDistMatrix) <- paste(vsd.obj$condition)
# Create color palette
colors <- colorRampPalette(
rev(RColorBrewer::brewer.pal(9, "Blues"))
)(255)
# Generate heatmap
pheatmap::pheatmap(
sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors,
main = "Sample Distance Matrix",
fontsize = 15,
border_color = NA
)
}
# Generate and save plots
p1 <- plotDists(vsd)
p1Figure 3: Hierarchical clustering of sample-to-sample distances
Key Observations:
The clustering patterns validate our experimental design and suggest high-quality data suitable for differential expression analysis.
Examining the most variable genes helps identify key drivers of sample differences and potential markers:
Analysis Strategy: 1. Calculate gene-wise variance 2. Select top variable genes 3. Center expression values 4. Create annotated heatmap 5. Identify expression patterns
# Function to create variable gene heatmap
variable_gene_heatmap <- function(vsd.obj, num_genes = 500, annotation, title = "") {
# Set up color scheme
brewer_palette <- "RdBu"
ramp <- colorRampPalette(RColorBrewer::brewer.pal(11, brewer_palette))
mr <- ramp(256)[256:1]
# Process expression data
stabilized_counts <- assay(vsd.obj)
row_variances <- rowVars(stabilized_counts)
# Select top variable genes
top_variable_genes <- stabilized_counts[
order(row_variances, decreasing=TRUE)[1:num_genes],
]
# Center expression values
top_variable_genes <- top_variable_genes -
rowMeans(top_variable_genes, na.rm=TRUE)
# Map to gene symbols
gene_names <- annotation$Gene.name[
match(rownames(top_variable_genes), annotation$Gene.stable.ID)
]
rownames(top_variable_genes) <- gene_names
# Prepare sample annotations
coldata <- as.data.frame(vsd.obj@colData)
coldata$sizeFactor <- NULL
# Generate heatmap
pheatmap::pheatmap(
top_variable_genes,
color = mr,
annotation_col = coldata,
fontsize_col = 8,
fontsize_row = 250/num_genes,
border_color = NA,
main = title,
show_rownames = TRUE,
cluster_cols = TRUE,
cluster_rows = TRUE,
clustering_method = "complete",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean"
)
}
# Generate and save heatmap
p2 <- variable_gene_heatmap(
vsd,
num_genes = 40,
annotation = annotation,
title = "Top 40 Most Variable Genes"
)
p2Figure 4: Heatmap of top 40 most variable genes
Key Patterns:
Biological Implications:
The distinct clustering patterns confirm the quality of our experimental design and data, providing a solid foundation for differential expression analysis.
Principal Component Analysis (PCA) helps visualize the major sources of variation in our dataset:
PCA Benefits: - Reduces data dimensionality - Identifies major variance sources - Reveals sample relationships - Highlights potential batch effects
# Function to create PCA plot
plot_PCA <- function(vsd.obj) {
# Generate PCA data
pcaData <- plotPCA(vsd.obj,
intgroup = c("condition"),
returnData = TRUE)
# Calculate variance percentages
percentVar <- round(100 * attr(pcaData, "percentVar"))
# Create enhanced PCA plot
ggplot(pcaData, aes(PC1, PC2, color = condition)) +
# Add points with larger size
geom_point(size = 8, alpha = 0.8) +
# Add labels with repulsion
ggrepel::geom_text_repel(
aes(label = name),
color = "black",
size = 3,
box.padding = 0.5,
point.padding = 0.5
) +
# Customize theme
theme_bw() +
theme(
legend.position = "right",
panel.grid.major = element_line(color = "grey90"),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
legend.title = element_text(size = 11, face = "bold"),
legend.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
# Add labels and title
labs(
x = paste0("PC1: ", percentVar[1], "% variance"),
y = paste0("PC2: ", percentVar[2], "% variance"),
title = "Principal Component Analysis",
color = "Condition"
)
}
# Generate and save PCA plot
p3 <- plot_PCA(vsd)
p3Figure 5: PCA plot
🔍 Key Observations:
📋 Recommended Analysis Strategy:
Given the strong cell type effect, we should: 1. Separate LNCaP and PC3 samples 2. Create individual DESeq2 objects per cell line 3. Analyze hypoxia effects independently 4. Compare responses between cell lines
This approach will help isolate treatment effects from the overwhelming cell type differences.
Key Points: - PCA reveals cell type as dominant source of variation - Separate analyses needed for unbiased hypoxia response assessment - Cell-specific analysis enables identification of: - Common hypoxia response genes - Cell type-specific responses - Potential therapeutic targets
Based on our exploratory analysis, we’ll employ a systematic approach:
1. Data Separation - Create cell line-specific datasets - Preserve biological replicates - Maintain data structure
2. Independent Analysis - Generate cell-specific DESeq2 objects - Perform normalized analyses - Control for technical factors
3. Response Comparison - Identify shared responses - Highlight cell-specific patterns - Quantify effect sizes
4. Biological Integration - Pathway analysis by cell type - Mechanistic insights - Therapeutic implications
We’ll create a robust function for generating cell-specific DESeq2 objects:
#' Create Cell-Specific DESeq2 Object
#'
#' This function creates a DESeq2 object for analyzing differential expression
#' within a specific cell line. It handles data subsetting, quality control,
#' and DESeq2 initialization.
#'
#' @param my_data Count matrix with genes as rows and samples as columns
#' @param groups Character vector of length 2 specifying conditions to compare
#' @return A DESeq2 object ready for differential expression analysis
generate_DESeq_object <- function(my_data, groups) {
# Input validation
if (length(groups) != 2) {
stop("Exactly two groups must be provided for comparison")
}
# Extract and validate data for each group
data_subset1 <- my_data[, grep(str_c("^", groups[1]), colnames(my_data))]
data_subset2 <- my_data[, grep(str_c("^", groups[2]), colnames(my_data))]
if (ncol(data_subset1) == 0 || ncol(data_subset2) == 0) {
stop("No samples found for one or both groups")
}
# Combine data
my_countData <- cbind(data_subset1, data_subset2)
# Create condition vector with informative names
condition <- c(
rep(groups[1], ncol(data_subset1)),
rep(groups[2], ncol(data_subset2))
)
# Create comprehensive sample metadata
my_colData <- data.frame(
condition = factor(condition),
cell_line = factor(str_extract(condition, "^[^_]+")),
treatment = factor(str_extract(condition, "[^_]+$")),
row.names = colnames(my_countData)
)
# Create DESeq2 object with quality filters
dds <- DESeqDataSetFromMatrix(
countData = my_countData,
colData = my_colData,
design = ~ condition
)
# Filter low-count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
message(sprintf("\nRetained %d genes after filtering", nrow(dds)))
# Run DESeq2 analysis
message("\nRunning DESeq2 analysis...")
dds <- DESeq(dds, quiet = TRUE)
# Report completion
message("Analysis complete!")
return(dds)
}Let’s create cell-specific DESeq2 objects and validate their structure:
# Create LNCaP-specific object
lncap <- generate_DESeq_object(
data_round,
c("LNCAP_Hypoxia", "LNCAP_Normoxia")
)
# Create PC3-specific object
pc3 <- generate_DESeq_object(
data_round,
c("PC3_Hypoxia", "PC3_Normoxia")
)
# Perform variance stabilizing transformation
lncap_vsd <- vst(lncap, blind = TRUE)
pc3_vsd <- vst(pc3, blind = TRUE)Quality Control Checks: 1. Sample metadata correctly assigned 2. Low-count genes filtered out 3. Normalization factors calculated 4. Design formula properly specified
Now we’ll examine hypoxia response patterns within each cell line:
We’ll analyze the most variable genes in each cell line to understand their distinct responses to hypoxia:
# Generate LNCaP variable genes heatmap
p4 <- variable_gene_heatmap(
lncap_vsd,
num_genes = 30,
annotation = annotation,
title = "LNCaP Hypoxia Response Genes"
)We observe that the LNCaP variable genes are enriched in genes that are specific to the LNCaP cell line, such as KLK3 (PSA) and KLK2. The PC3 variable genes are enriched in genes that are specific to the PC3 cell line, such as TMPRSS2 and SLC45A3.
# Generate PC3 variable genes heatmap
p5 <- variable_gene_heatmap(pc3_vsd, 30, annotation = annotation, title = "PC3 variable genes")
p6 <- gridExtra::grid.arrange(p4[[4]],p5[[4]], nrow = 1)
p6Figure 6: LNCaP_PC3_variable_genes plot
This is a vastly different set of genes than what was observed when we had samples from both cell lines combined. Our differential gene expression analysis should now yield a greater depth of genes for the comparison we care about: hypoxia vs normoxia.
Extract differentially expressed genes between hypoxia and normoxia
conditions using DESeq2’s results() function:
#> log2 fold change (MLE): condition LNCAP_Hypoxia vs LNCAP_Normoxia
#> Wald test p-value: condition LNCAP_Hypoxia vs LNCAP_Normoxia
#> DataFrame with 23422 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat pvalue padj
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000000003 518.727 1.0693831 0.256250 4.173203 0.0000300347 0.000328262
#> ENSG00000000419 4939.277 -0.1333728 0.120121 -1.110321 0.2668606680 0.470253565
#> ENSG00000000457 681.180 0.4480815 0.182940 2.449339 0.0143118852 0.056207398
#> ENSG00000000460 390.541 -0.0330573 0.243049 -0.136011 0.8918125879 0.949226832
#> ENSG00000001036 398.773 -0.6328722 0.298934 -2.117099 0.0342514141 0.108988224
#> ... ... ... ... ... ... ...
#> ENSG00000288573 11.07937 -0.6620047 1.225872 -0.540027 0.5891781 0.765965
#> ENSG00000288582 3.29568 -1.0474228 2.025291 -0.517172 0.6050364 NA
#> ENSG00000288585 70.09038 -0.9664820 0.517927 -1.866057 0.0620334 0.169749
#> ENSG00000288586 77.58432 -0.6772347 0.455622 -1.486398 0.1371740 0.300559
#> ENSG00000288587 16.53520 -0.0978059 0.910850 -0.107379 0.9144885 0.960657
Key statistics from DESeq2 results: - baseMean: Overall
gene expression level - log2FoldChange: Expression
difference between conditions (negative = higher in normoxia) -
padj: Adjusted p-value for significance
Filtering criteria: 1. Statistical significance: padj < cutoff 2. Expression change: Sort by log2FoldChange 3. Optional: Filter low-expressed genes using baseMean or CPM 4. Generate ranked list (.rnk) for enrichment analysis
Since there are many variations of sorting / filtering the data, I
wrote the function generate_DE_results() to perform
multiple filtering steps and provide output in the form of csv files. I
export these files so they can be easily shared and interpreted by other
bench scientists without having to read the data into R.
generate_DE_results <- function (dds, comparisons, padjcutoff = 0.001, log2cutoff = 0.5, cpmcutoff = 2) {
# generate average counts per million metric from raw count data
raw_counts <- counts(dds, normalized = F)
cpms <- enframe(rowMeans(edgeR::cpm(raw_counts)))
colnames(cpms) <- c("ensembl_id", "avg_cpm")
# extract DESeq results between the comparisons indicated
res <- results(dds, contrast = c("condition", comparisons[1], comparisons[2]))[,-c(3,4)]
# annotate the data with gene name and average counts per million value
res <- as_tibble(res, rownames = "ensembl_id")
# read in the annotation and append it to the data
my_annotation <- read.csv("reference_genomes/GRCh38.p14_annotation.csv", header = TRUE, stringsAsFactors = F)
res <- left_join(res, my_annotation, by = c("ensembl_id" = "Gene.stable.ID"))
# append the average cpm value to the results data
res <- left_join(res, cpms, by = c("ensembl_id" = "ensembl_id"))
# combine normalized counts with entire DE list
normalized_counts <- round(counts(dds, normalized = TRUE),3)
pattern <- str_c(comparisons[1], "|", comparisons[2])
combined_data <- as_tibble(cbind(res, normalized_counts[,grep(pattern, colnames(normalized_counts))] ))
combined_data <- combined_data[order(combined_data$log2FoldChange, decreasing = TRUE),]
# make ordered rank file for GSEA, selecting only protein coding genes
res_prot <- res[which(res$Gene.type == "protein_coding"),]
res_prot_ranked <- res_prot[order(res_prot$log2FoldChange, decreasing = TRUE),c("Gene.name", "log2FoldChange")]
res_prot_ranked <- na.omit(res_prot_ranked)
res_prot_ranked$Gene.name <- str_to_upper(res_prot_ranked$Gene.name)
# generate sorted lists with the indicated cutoff values
res <- res[order(res$log2FoldChange, decreasing=TRUE ),]
de_genes_padj <- res[which(res$padj < padjcutoff),]
de_genes_log2f <- res[which(abs(res$log2FoldChange) > log2cutoff & res$padj < padjcutoff),]
de_genes_cpm <- res[which(res$avg_cpm > cpmcutoff & res$padj < padjcutoff),]
# write output to files
write.csv (de_genes_padj, file = paste0("results/", comparisons[1], "_vs_", comparisons[2], "_padj_cutoff.csv"), row.names = FALSE)
write.csv (de_genes_log2f, file = paste0("results/", comparisons[1], "_vs_", comparisons[2], "_log2f_cutoff.csv"), row.names = FALSE)
write.csv (de_genes_cpm, file = paste0("results/", comparisons[1], "_vs_", comparisons[2], "_cpm_cutoff.csv"), row.names = FALSE)
write.csv (combined_data, file = paste0("results/", comparisons[1], "_vs_", comparisons[2], "_allgenes.csv"), row.names = FALSE)
write.table (res_prot_ranked, file = paste0("results/", comparisons[1], "_vs_", comparisons[2], "_rank.rnk"), sep = "\t", row.names = F, quote = F)
writeLines( paste0("For the comparison: ", comparisons[1], "_vs_", comparisons[2], ", out of ", nrow(combined_data), " genes, there were: \n",
nrow(de_genes_padj), " genes below padj ", padjcutoff, "\n",
nrow(de_genes_log2f), " genes below padj ", padjcutoff, " and above a log2FoldChange of ", log2cutoff, "\n",
nrow(de_genes_cpm), " genes below padj ", padjcutoff, " and above an avg cpm of ", cpmcutoff, "\n",
"Gene lists ordered by log2fchange with the cutoffs above have been generated.") )
gene_count <- tibble (cutoff_parameter = c("padj", "log2fc", "avg_cpm" ),
cutoff_value = c(padjcutoff, log2cutoff, cpmcutoff),
signif_genes = c(nrow(de_genes_padj), nrow(de_genes_log2f), nrow(de_genes_cpm)))
invisible(gene_count)
}After running the function, the console should display the number of genes that passed each filtering critera:
lncap_output <- generate_DE_results (lncap, c("LNCAP_Hypoxia", "LNCAP_Normoxia"))
pc3_output <- generate_DE_results(pc3, c("PC3_Hypoxia", "PC3_Normoxia"))#> For the comparison: LNCAP_Hypoxia_vs_LNCAP_Normoxia, out of 23422 genes, there were:
#> 1984 genes below padj 0.001
#> 1963 genes below padj 0.001 and above a log2FoldChange of 0.5
#> 1875 genes below padj 0.001 and above an avg cpm of 2
#> Gene lists ordered by log2fchange with the cutoffs above have been generated.
#> For the comparison: PC3_Hypoxia_vs_PC3_Normoxia, out of 20576 genes, there were:
#> 765 genes below padj 0.001
#> 765 genes below padj 0.001 and above a log2FoldChange of 0.5
#> 653 genes below padj 0.001 and above an avg cpm of 2
#> Gene lists ordered by log2fchange with the cutoffs above have been generated.
We should also see a set of csv files show up in the same folder as
the R script. We can read in the output files with
read.csv() and observe what the file structure is like:
The _allgenes.csv file will be used in the next section
for visualizations. It is an aggregate of all the data we have curated
so far: the results() output, the gene annotations, and the
normalized counts for every gene. We can further organize/filter the
data for the visualizations based on user preferences. The other csv
files, such as _padj_cutoff.csv, can be distributed to
researchers who are only interested in the gene lists.
Visualize differential expression results to examine: 1. Expression patterns of specific genes 2. Sample variability and top DEGs 3. Distribution of up/down regulated genes 4. Shared DEGs between comparisons 5. Pathway enrichment analysis
We are often interested in plotting the normalized counts for certain
genes of interest, in order to get a sense of the spread of the data and
how genes are expressed across each of the groups. This is especially of
interest for genes that show up as differentially expressed in our
DESeq2 results. A way to plot the normalized counts for each sample is
by using the built-in DESeq2::plotCounts() function.
However, this function leaves a lot to be desired. As an
example, I plot the gene IGFBP1 (a hypoxia-inducible gene) using its
Ensembl ID below.
As you can see, the sample names aren’t even all displayed if their
names are too long. We can instead use ggplot2 to graph the data, by
extracting the normalized counts using
returnData = TRUE:
d <- plotCounts(dds, gene="ENSG00000146678", intgroup="condition", returnData=TRUE)
p7 <- ggplot(d, aes(x=condition, y=count)) +
geom_point(size = 10, position=position_jitter(w=0.4,h=0), aes (color = condition)) + scale_y_log10(breaks=c(25,100,400))
p7Figure 7: ENSG00000146678 plot
However, this is still very frustrating to work with, since we have
to input the Ensembl ID instead of a gene name. Researchers typically
want to search by gene name, and this function forces us to look up the
Ensembl ID each time. To solve this problem, I wrote my own function
plot_counts() to accept either the gene name, Ensembl ID,
or an index such as which.min(res$padj). I also provide two
ways to normalize the data: by counts per million (cpm) or using the
built-in DESeq2 normalized counts.
plot_counts <- function (dds, gene, normalization = "DESeq2"){
# read in the annotation file
annotation <- read.csv("reference_genomes/GRCh38.p14_annotation.csv", header = TRUE, stringsAsFactors = F)
# obtain normalized data
if (normalization == "cpm") {
normalized_data <- cpm(counts(dds, normalized = F)) # normalize the raw data by counts per million
} else if (normalization == "DESeq2")
normalized_data <- counts(dds, normalized = TRUE) # use DESeq2 normalized counts
# get sample groups from colData
condition <- dds@colData$condition
# get the gene name from the ensembl id
if (is.numeric(gene)) { # check if an index is supplied or if ensembl_id is supplied
if (gene%%1==0 )
ensembl_id <- rownames(normalized_data)[gene]
else
stop("Invalid index supplied.")
} else if (gene %in% annotation$Gene.name){ # check if a gene name is supplied
ensembl_id <- annotation$Gene.stable.ID[which(annotation$Gene.name == gene)]
} else if (gene %in% annotation$Gene.stable.ID){
ensembl_id <- gene
} else {
stop("Gene not found. Check spelling.")
}
expression <- normalized_data[ensembl_id,]
gene_name <- annotation$Gene.name[which(annotation$Gene.stable.ID == ensembl_id)]
# construct a tibble with the grouping and expression
gene_tib <- tibble(condition = condition, expression = expression)
ggplot(gene_tib, aes(x = condition, y = expression))+
geom_boxplot(aes(fill = condition), outlier.size = NULL)+
geom_point(aes(color = condition))+
labs (title = paste0("Expression of ", gene_name, " - ", ensembl_id), x = "group", y = paste0("Normalized expression (", normalization , ")"))+
theme(axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 11))
}
p8 <- plot_counts(dds, "IGFBP1")
p8Figure 8: plot_counts
We observe that IGFBP1 is elevated in the the hypoxia conditions relative to the respective normoxia controls. Interestingly, the upregulation of IGFBP1 is much higher in PC3 cells than in LNCaP cells. This is a nice example of how this plot is useful to determine relative expression levels between samples.
The most common heatmap used to display RNA-seq results is the
differential gene heatmap. This is a heatmap where the normalized values
for each sample are plotted for the top most upregulated genes which are
significantly different between the comparison groups. Starting from the
results object, we first filter the results by padj and
then take the genes which have the greatest log2FoldChange
value. The normalized counts for each gene is scaled by row (across
samples) in order to emphasize the difference between the comparison
groups (hypoxia vs normoxia). The purpose of this visualization is to
give a sense of how variable the data are from sample to sample, as well
as to quickly examine the top differential genes. I wrote the function
DE_gene_heatmap() to generate this plot, using
pheatmap::pheatmap() with the scale = "row"
argument.
res <- read.csv ("results/LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = TRUE)
DE_gene_heatmap <- function(res, padj_cutoff = 0.0001, ngenes = 20) {
# generate the color palette
brewer_palette <- "RdBu"
ramp <- colorRampPalette(RColorBrewer::brewer.pal(11, brewer_palette))
mr <- ramp(256)[256:1]
# obtain the significant genes and order by log2FoldChange
significant_genes <- res %>% filter(padj < padj_cutoff) %>% arrange (desc(log2FoldChange)) %>% head (ngenes)
heatmap_values <- as.matrix(significant_genes[,-c(1:8)])
rownames(heatmap_values) <- significant_genes$Gene.name
# plot the heatmap using pheatmap
pheatmap::pheatmap(heatmap_values, color = mr, scale = "row", fontsize_col = 12, fontsize_row = 200/ngenes, fontsize = 5, border_color = NA)
}
p9 <- DE_gene_heatmap(res, 0.001, 30)
p9Figure 9: DE_gene_heatmap plot
We observe that expression of the differential genes is very consistent within groups. We are also able to identify that LNCAP_Hypoxia_S1 has higher expression of the top ~10 genes than LNCAP_Hypoxia_S2.
The volcano plot is a powerful visualization that combines statistical significance with magnitude of change, allowing us to identify genes with both large fold changes and high statistical confidence. In this plot:
log2FoldChange shows the magnitude and
direction of expression changes-log10(padj) represents statistical
significanceInterpretation Guide: 1. Top right quadrant: Significantly upregulated genes (high fold change, low p-value) 2. Top left quadrant: Significantly downregulated genes (negative fold change, low p-value) 3. Center/bottom: Genes with minimal changes or insufficient statistical confidence
Let’s create an enhanced volcano plot for the LNCaP hypoxia vs normoxia comparison:
# Read the results file
res <- read.csv("results/LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = TRUE)
plot_volcano <- function(res, padj_cutoff = 0.05, fc_cutoff = 1, nlabel = 10, label.by = "padj") {
# Create significance categories
res <- res %>%
mutate(
expression = case_when(
log2FoldChange >= fc_cutoff & padj < padj_cutoff ~ "Upregulated",
log2FoldChange <= -fc_cutoff & padj < padj_cutoff ~ "Downregulated",
TRUE ~ "Not Significant"
)
) %>%
filter(!is.na(padj)) # Remove NA values
# Prepare genes for labeling
if (label.by == "padj") {
up_genes <- res %>%
filter(expression == "Upregulated") %>%
arrange(padj) %>%
head(nlabel)
down_genes <- res %>%
filter(expression == "Downregulated") %>%
arrange(padj) %>%
head(nlabel)
} else if (label.by == "log2FoldChange") {
up_genes <- res %>%
filter(expression == "Upregulated") %>%
arrange(desc(abs(log2FoldChange))) %>%
head(nlabel)
down_genes <- res %>%
filter(expression == "Downregulated") %>%
arrange(desc(abs(log2FoldChange))) %>%
head(nlabel)
}
# Generate statistics for title
stats <- list(
total = nrow(res),
up = sum(res$expression == "Upregulated"),
down = sum(res$expression == "Downregulated")
)
# Create enhanced volcano plot
ggplot(res, aes(x = log2FoldChange, y = -log10(padj))) +
# Add points
geom_point(aes(color = expression), alpha = 0.7, size = 1) +
# Color scheme
scale_color_manual(
values = c(
"Upregulated" = "#fc8d59",
"Downregulated" = "#91bfdb",
"Not Significant" = "grey80"
)
) +
# Add gene labels with repulsion
ggrepel::geom_text_repel(
data = up_genes,
aes(label = Gene.name),
size = 3,
color = "#d73027",
max.overlaps = Inf,
box.padding = 0.5
) +
ggrepel::geom_text_repel(
data = down_genes,
aes(label = Gene.name),
size = 3,
color = "#4575b4",
max.overlaps = Inf,
box.padding = 0.5
) +
# Add threshold lines
geom_vline(xintercept = c(-fc_cutoff, fc_cutoff), linetype = "dashed", color = "grey50", alpha = 0.5) +
geom_hline(yintercept = -log10(padj_cutoff), linetype = "dashed", color = "grey50", alpha = 0.5) +
# Customize theme
theme_minimal() +
theme(
legend.position = "right",
panel.grid.minor = element_blank(),
plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 10)
) +
# Add labels
labs(
title = "Differential Expression Analysis",
subtitle = sprintf("Total: %d genes | Up: %d | Down: %d (padj < %.2g, |log2FC| > %.1f)",
stats$total, stats$up, stats$down, padj_cutoff, fc_cutoff),
x = "log2 Fold Change",
y = "-log10(adjusted p-value)",
color = "Expression Change"
)
}
# Generate volcano plots with different thresholds and labeling strategies
# 1. Default stringent analysis
p10 <- plot_volcano(res,
padj_cutoff = 0.001, # Stringent significance threshold
fc_cutoff = 1.5, # Require at least 2.8-fold change
nlabel = 10, # Label top 10 genes
label.by = "padj") # Label most significant genes
p10Figure 10: Volcano plot_1(padj_cutoff = 0.001)
# 2. Alternative analysis focusing on fold change
p11 <- plot_volcano(res,
padj_cutoff = 0.05, # More permissive significance
fc_cutoff = 2, # Focus on genes with larger changes
nlabel = 15, # Label more genes
label.by = "log2FoldChange") # Label genes with largest changes
p11Figure 11: Volcano plot_2(padj_cutoff = 0.05)
# Create a column for gene names from the rownames
res$gene_names <- res$Gene.name
# Sort the results by padj and then by log2FoldChange
# This is a good way to find the most significant and most-changed genes
sorted_res <- res[order(res$padj, abs(res$log2FoldChange), decreasing = c(FALSE, TRUE)), ]
# Filter for significantly upregulated genes
top_up_genes <- subset(sorted_res, log2FoldChange > 2 & padj < 0.05)
# Filter for significantly downregulated genes
top_down_genes <- subset(sorted_res, log2FoldChange < -2 & padj < 0.05)
# Select the top 10 gene names from each list
select_lab_list <- c(head(top_up_genes, 10)$gene_names, head(top_down_genes, 10)$gene_names)
# Create the enhanced volcano plot
p13 <- EnhancedVolcano(res,
lab = res$gene_name, # Use the column with gene names
x = 'log2FoldChange',
y = 'padj',
ylab = bquote(~-Log[10] ~ italic(FDR)),
pCutoff = 0.05,
FCcutoff = 2,
selectLab = select_lab_list, # Add this line
legendLabels=c('Not DE &\nabsolute FC < 2',
'Not DE &\nabsolute FC > 2',
'FDR < 0.05 &\nabsolute FC < 2',
'FDR < 0.05 &\nabsolute FC > 2'
),
legendPosition = 'right')
p13Figure 12: Volcanoplot-enhanced
The downregulated genes are labeled in a different color compared to the upregulated genes. We observe that there are a lot more significantly upregulated genes than there are downregulated genes. Given that we are looking for gene expression changes in cells grown in hypoxic conditions compared to normoxic conditions, this suggests that cells may be ramping up transcription of response pathways.
Key Findings from Volcano Analysis:
# Save results with detailed categorization
res_processed <- res %>%
mutate(
expression_category = case_when(
log2FoldChange >= 2 & padj <= 0.001 ~ "Strongly Upregulated",
log2FoldChange >= 1 & padj <= 0.05 ~ "Upregulated",
log2FoldChange <= -2 & padj <= 0.001 ~ "Strongly Downregulated",
log2FoldChange <= -1 & padj <= 0.05 ~ "Downregulated",
TRUE ~ "Not Significant"
)
)
# Save results for downstream analysis
saveRDS(res_processed, "results/processed_differential_expression.rds")
write.csv(res_processed, "results/processed_differential_expression.csv", row.names = FALSE)Given that we have performed differential gene expression analysis on two different cell lines, we would like to compare the two lines in terms of how their gene expression responds to hypoxia. We want to answer questions such as:
To do so, we can take the significant genes from the hypoxia vs
normoxia comparison for both lines and generate a scatterplot of their
log2FoldChange values using the custom function below,
compare_significant_genes(). This function will display the
overlap in the gene lists by color coding significant genes that are
shared by both cell lines, as well as those that are unique to each
individual cell line.
res1 <- read.csv ("results/LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = TRUE)
res2 <- read.csv ("results/PC3_Hypoxia_vs_PC3_Normoxia_allgenes.csv", header = TRUE)
compare_significant_genes <- function (res1, res2, padj_cutoff=0.0001, ngenes=250, nlabel=10, samplenames=c("comparison1", "comparison2"), title = "" ) {
# get list of most upregulated or downregulated genes for each results table
genes1 <- rbind(head(res1[which(res1$padj < padj_cutoff),], ngenes), tail(res1[which(res1$padj < padj_cutoff),], ngenes))
genes2 <- rbind(head(res2[which(res2$padj < padj_cutoff),], ngenes), tail(res2[which(res2$padj < padj_cutoff),], ngenes))
# combine the data from both tables
de_union <- union(genes1$ensembl_id,genes2$ensembl_id)
res1_union <- res1[match(de_union, res1$ensembl_id),][c("ensembl_id", "log2FoldChange", "Gene.name")]
res2_union <- res2[match(de_union, res2$ensembl_id),][c("ensembl_id", "log2FoldChange", "Gene.name")]
combined <- left_join(res1_union, res2_union, by = "ensembl_id", suffix = samplenames )
# identify overlap between genes in both tables
combined$de_condition <- 1 # makes a placeholder column
combined$de_condition[which(combined$ensembl_id %in% intersect(genes1$ensembl_id,genes2$ensembl_id))] <- "Significant in Both"
combined$de_condition[which(combined$ensembl_id %in% setdiff(genes1$ensembl_id,genes2$ensembl_id))] <- paste0("Significant in ", samplenames[1])
combined$de_condition[which(combined$ensembl_id %in% setdiff(genes2$ensembl_id,genes1$ensembl_id))] <- paste0("Significant in ", samplenames[2])
combined[is.na(combined)] <- 0
# find the top most genes within each condition to label on the graph
label1 <- rbind(head(combined[which(combined$de_condition==paste0("Significant in ", samplenames[1])),],nlabel),
tail(combined[which(combined$de_condition==paste0("Significant in ", samplenames[1])),],nlabel))
label2 <- rbind(head(combined[which(combined$de_condition==paste0("Significant in ", samplenames[2])),],nlabel),
tail(combined[which(combined$de_condition==paste0("Significant in ", samplenames[2])),],nlabel))
label3 <- rbind(head(combined[which(combined$de_condition=="Significant in Both"),],nlabel),
tail(combined[which(combined$de_condition=="Significant in Both"),],nlabel))
combined_labels <- rbind(label1,label2,label3)
# plot the genes based on log2FoldChange, color coded by significance
ggplot(combined, aes_string(x = paste0("log2FoldChange", samplenames[1]), y = paste0("log2FoldChange", samplenames[2]) )) +
geom_point(aes(color = de_condition), size = 0.7)+
scale_color_manual(values= c("#00BA38", "#619CFF", "#F8766D", "red"))+
ggrepel::geom_text_repel(data= combined_labels, aes_string(label=paste0("Gene.name", samplenames[1]), color = "de_condition"), show.legend = F, size=3)+
geom_vline(xintercept = c(0,0), size = 0.3, linetype = 2)+
geom_hline(yintercept = c(0,0), size = 0.3, linetype = 2)+
labs(title = title,x = paste0("log2FoldChange in ", samplenames[1]), y = paste0("log2FoldChange in ", samplenames[2]))+
theme_minimal()+
theme(legend.title = element_blank())
}
p14 <- compare_significant_genes(res1,res2, samplenames = c("LNCaP", "PC3"), title = "Hypoxia-induced gene expression differences in LNCaP vs PC3 cells")
p14Figure 14: compare_significant_genes plot
Only a small subset of genes are differentially expressed in both LNCaP and PC3 (green labels), mostly along the x = y diagonal (e.g., CA9, IGFBP5, STC1, PPFIA4), indicating a shared hypoxia response. Genes significant in only one line cluster on the axes, suggesting cell line–specific effects.
To understand the biological significance of the differential expression patterns, we’ll perform a comprehensive functional enrichment analysis using multiple approaches:
Multi-level Analysis Approach:
GSEA provides a powerful way to identify coordinated gene expression changes at the pathway level(Mootha et al. 2003). We’ll analyze both cell lines to understand shared and unique pathway responses to hypoxia.
We need to download the pathway files from the GSEA MSigDB webpage
at: https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp
and load them in using fgsea::gmtPathways(). I will be
working with the HALLMARK pathway set.
library(fgsea)
# read in file containing lists of genes for each pathway
hallmark_pathway <- gmtPathways("reference_genomes/h.all.v2024.1.Hs.symbols.gmt.txt")
head(names(hallmark_pathway))#> [1] "HALLMARK_ADIPOGENESIS" "HALLMARK_ALLOGRAFT_REJECTION" "HALLMARK_ANDROGEN_RESPONSE"
#> [4] "HALLMARK_ANGIOGENESIS" "HALLMARK_APICAL_JUNCTION" "HALLMARK_APICAL_SURFACE"
🔍 The pathway file is loaded as a list of gene sets. We can access
the genes for each pathway by selecting the pathway name using
$:
#> [1] "ACKR3" "ADM" "ADORA2B" "AK4" "AKAP12" "ALDOA" "ALDOB" "ALDOC" "AMPD3" "ANGPTL4"
#> [11] "ANKZF1" "ANXA2" "ATF3" "ATP7A" "B3GALT6" "B4GALNT2" "BCAN" "BCL2" "BGN" "BHLHE40"
After loading the pathways, we have to turn our ranked list into a
vector, in which the log2FoldChange value is named with the
gene name. We also need to get rid of any NA values or duplicate gene
entries. The custom function prepare_ranked_list() will
perform these operations.
# load the ranked list
lncap_ranked_list <- read.table("results/LNCAP_Hypoxia_vs_LNCAP_Normoxia_rank_2.rnk", header = TRUE, stringsAsFactors = F)
head(lncap_ranked_list)# formats the ranked list for the fgsea() function
prepare_ranked_list <- function(ranked_list) {
# if duplicate gene names present, average the values
if( sum(duplicated(ranked_list$Gene.name)) > 0) {
ranked_list <- aggregate(.~Gene.name, FUN = mean, data = ranked_list)
ranked_list <- ranked_list[order(ranked_list$log2FoldChange, decreasing = TRUE),]
}
# omit rows with NA values
ranked_list <- na.omit(ranked_list)
# turn the dataframe into a named vector
ranked_list <- tibble::deframe(ranked_list)
ranked_list
}
lncap_ranked_list <- prepare_ranked_list(lncap_ranked_list)
head(lncap_ranked_list)#> SMIM10L3 PCDH19 NXF2 RYR3 CYP4F22 TNFRSF6B
#> 10.93 10.66 8.74 8.51 7.80 7.69
Now that we have the named vector, we can plug it into the
fgsea() function along with the hallmark pathways object.
This will generate a table of results containing the enrichment scores
associated with each pathway.
# generate GSEA results table using fgsea() by inputting the pathway list and ranked list
fgsea_results <- fgsea(pathways = hallmark_pathway,
stats = lncap_ranked_list,
minSize = 15,
maxSize = 500,
nperm= 1000)
fgsea_results %>% arrange (desc(NES)) %>% dplyr::select (pathway, padj, NES) %>% head()The normalized enrichment scores (NES) tell us how much more enriched the pathway is in the hypoxia samples compared to the normoxia samples. As expected, we observe that the most enriched pathway in response to hypoxia is the HALLMARK_HYPOXIA pathway.
We can visualize the statistics for each pathway using a “waterfall” plot, which is a sideways bar plot of normalized enrichment scores for each of the pathways, color-coded by significance. This plot is great for quickly identifying the significantly enriched pathways.
waterfall_plot <- function (fsgea_results, graph_title) {
fgsea_results %>%
mutate(short_name = str_split_fixed(pathway, "_",2)[,2])%>% # removes 'HALLMARK_' from the pathway title
ggplot( aes(reorder(short_name,NES), NES)) +
geom_bar(stat= "identity", aes(fill = padj<0.05))+
coord_flip()+
labs(x = "Hallmark Pathway", y = "Normalized Enrichment Score", title = graph_title)+
theme(axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 1))
}
p15 <- waterfall_plot(fgsea_results, "Hallmark pathways altered by hypoxia in LNCaP cells")
p15Figure 15: Waterfall plot
🔬 Besides hypoxia, androgen response and MTORC1 signaling are also enriched under hypoxic conditions. Glycolysis is upregulated, while oxidative phosphorylation is downregulated, suggesting a metabolic shift similar to the Warburg effect. This may indicate hypoxia drives the switch from oxidative phosphorylation to glycolysis in tumors.
Interferon response pathways are negatively enriched in hypoxia, implying reduced interferon responsiveness, which is unexpected given the in vitro setting.
We can highlight specific pathways by plotting enrichment curves with
fgsea::plotEnrichment(). Black ticks mark pathway genes;
the green curve shows enrichment for hypoxia (left) or normoxia (right).
Example curves are shown below.
# wrapper for fgsea::plotEnrichment()
plot_enrichment <- function (geneset, pathway, ranked_list) {
plotEnrichment(geneset[[pathway]], ranked_list)+labs (title = pathway)
}
# example of positively enriched pathway (up in Hypoxia)
p16 <- plot_enrichment(hallmark_pathway, "HALLMARK_GLYCOLYSIS" , lncap_ranked_list)
p16Figure 16: Enrichment plot
# example of negatively enriched pathway (down in Hypoxia)
p17 <- plot_enrichment(hallmark_pathway, "HALLMARK_OXIDATIVE_PHOSPHORYLATION" , lncap_ranked_list)
p17Figure 17: Enrichment plot
Gene Set Enrichment Analysis (GSEA) is a sophisticated computational method that evaluates whether pre-defined groups of genes (such as those associated with specific GO terms or KEGG pathways) exhibit statistically significant, coordinated expression changes between different biological conditions. Unlike traditional single-gene analyses, GSEA considers the entire ranked list of genes to identify biological pathways and processes that are systematically altered in the experimental condition.
This section demonstrates the implementation of GSEA using the clusterProfiler package(gencore.bio.nyu.edu 2021), a versatile R package that provides comprehensive tools for statistical analysis and visualization of functional profiles for genes and gene clusters. For detailed documentation and advanced usage, please refer to: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
# reading in data from deseq2
df = read.csv("results/res_2.csv")
# df = read.csv("data/drosphila_example_de.csv", header=TRUE)
# we want the log2 fold change
original_gene_list <- df$log2FoldChange
# name the vector
names(original_gene_list) <- df$ensembl_id
# omit any NA values
gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)
head(gene_list)#> ENSG00000207005 ENSG00000286075 ENSG00000165194 ENSG00000269405 ENSG00000100101 ENSG00000198838
#> 23.44 10.93 10.66 8.74 8.67 8.51
Parameters: - keyType: Source of gene
IDs (e.g., “ENTREZID”, “ENSEMBL”, “SYMBOL”). Check available options
with keytypes(org.Dm.eg.db). - ont: Ontology;
one of “BP”, “MF”, “CC”, or “ALL”. - nPerm: Number of
permutations (higher = more accurate, slower). -
minGSSize/maxGSSize: Minimum/maximum gene set
size to include. - pvalueCutoff: P-value threshold. -
pAdjustMethod: Adjustment method (e.g., “BH”,
“bonferroni”).
# library(org.Dm.eg.db)
gse <- clusterProfiler::gseGO(geneList=gene_list,
ont ="ALL",
keyType = "ENSEMBL",
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = org.Hs.eg.db,
pAdjustMethod = "none")require(DOSE)
p18 <- enrichplot::dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign) + scale_size(range = c(1, 8)) # Change the dot size range
p18Figure 18: Dotplot
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
### apply 'pairwise_termsim' before making plots
## calculate the similarity matrix for all gene sets, and not the top 200!
## note that this takes some time
gse_2 <- pairwise_termsim(gse, showCategory = dim(gse)[1] )
## Extract the gene sets. These are the 20 sets that are in the dotplot (10 activated, 10 supressed).
geneSets2plot <- p17$data$Description
## make a emapplot showing only the selected gene sets
p19 <- enrichplot::emapplot(gse_2, showCategory=geneSets2plot)
p19Figure 19: Enrichment map
The treeplot() function performs hierarchical clustering of enriched terms. It relies on the pairwise similarities of the enriched terms calculated by the pairwise_termsim() function, which by default using Jaccard’s similarity index (JC). Users can also use semantic similarity values if it is supported (e.g., GO, DO and MeSH).
Figure 20: Tree plot
The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).
# categorySize can be either 'pvalue' or 'geneNum'
p21 <- cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)
p21Figure 21: Category netplot
The upsetplot is an alternative to cnetplot for visualizing the complex association between genes and gene sets. It emphasizes the gene overlapping among different gene sets.
Figure 22: UpSet plot
Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.
Figure 23: Ridge plot
Shows the running enrichment score for a gene set (green line), with the maximum score (red line) and gene set members (black lines) along the ranked gene list. The ranked metric (log2 fold change) reflects gene–phenotype correlation.
Parameter: - Gene Set Integer: Index of
the gene set in the gse object (1 = first gene set).
# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
p24 <- gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)
p24Figure 24: GSEA plot
To run KEGG pathway enrichment with gseKEGG(), gene IDs
must be converted using the bitr function from
clusterProfiler. Some messages or warnings are normal.
Set fromType in bitr to match the
annotation source (same as keyType in gseGO).
Set toType to one of: ‘kegg’, ‘ncbi-geneid’,
‘ncib-proteinid’, or ‘uniprot’. For org.Dm.eg.db, use
‘ENTREZID’ (equivalent to ‘ncbi-geneid’).
Use original_gene_list as the input for conversion.
# Convert gene IDs for gseKEGG function
# We will lose some genes here because not all IDs will be converted
ids<-clusterProfiler::bitr(names(original_gene_list), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb=org.Hs.eg.db)
# remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids$ENSEMBL),]
# Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above
df2 = df[df$ensembl_id %in% dedup_ids$ENSEMBL,]
# Create a new column in df2 with the corresponding ENTREZ IDs
df2$Y = dedup_ids$ENTREZID
# Create a vector of the gene unuiverse
kegg_gene_list <- df2$log2FoldChange
# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y
# omit any NA values
kegg_gene_list<-na.omit(kegg_gene_list)
# *******************************************
kegg_gene_list <- kegg_gene_list[!duplicated(names(kegg_gene_list))]
# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)organism: 3-letter KEGG code (see: https://www.genome.jp/kegg/catalog/org_list.html).
Define as kegg_organism for reuse.nPerm: Number of permutations (higher = more accurate,
slower).minGSSize/maxGSSize: Minimum/maximum gene
set size to include.pvalueCutoff: P-value threshold.pAdjustMethod: Adjustment method (e.g., “BH”,
“bonferroni”).keyType: One of ‘kegg’, ‘ncbi-geneid’,
‘ncib-proteinid’, or ‘uniprot’.kegg_organism = "hsa"
kk2 <- gseKEGG(geneList = kegg_gene_list,
organism = kegg_organism,
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid")p26 <- dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign) + scale_size(range = c(1, 8)) # Change the dot size range
p26Figure 26: Dotplot
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
### apply 'pairwise_termsim' before making plots
## calculate the similarity matrix for all gene sets, and not the top 200!
## note that this takes some time
kk3 <- pairwise_termsim(kk2, showCategory = dim(kk2)[1] )
## Extract the gene sets. These are the 20 sets that are in the dotplot (10 activated, 10 supressed).
geneSets2plot_2 <- p23$data$Description
## make a emapplot showing only the selected gene sets
p27 <- enrichplot::emapplot(kk3, showCategory=geneSets2plot_2)
p27Figure 27: Enrichment map
The cnetplot shows connections between genes and biological categories (e.g., GO terms, KEGG pathways) as a network, highlighting shared genes across categories.
# categorySize can be either 'pvalue' or 'geneNum'
p28 <- cnetplot(kk2, categorySize="pvalue", foldChange=gene_list)
p28Figure 28: Category netplot
Helpful to interpret up/down-regulated pathways.
Figure 29: Ridge plot
Standard visualization of GSEA results.
Parameter: - Gene Set Integer: Index of
the gene set in the gse object (1 = first gene set; default
is 1).
# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
p30 <- gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)
p30Figure 30: GSEA plot
Pathview visualizes gene expression on KEGG pathways, outputting PNG (native layout) and PDF (publication-ready) formats.
Key Parameters: - gene.data: Named
vector of gene expression (e.g., kegg_gene_list), names =
ENTREZ IDs, values = log2 fold changes - pathway.id: KEGG
pathway ID (e.g., “hsa04130”), found in gseKEGG results; can be a vector
- species: 3-letter KEGG organism code (e.g., “hsa”), must
match gseKEGG analysis
library(pathview)
# Produce the native KEGG plot (PNG)
dme1 <- pathview(gene.data=kegg_gene_list, pathway.id="hsa04130", species = "hsa")
# Produce a different plot (PDF) (not displayed here)
dme2 <- pathview(gene.data=kegg_gene_list, pathway.id="hsa03060", species = "hsa", kegg.native = F)Figure 31: KEGG pathway diagram for hsa04130
Circos plots are powerful tools for visualizing genomic data in a circular layout, allowing us to represent complex genomic relationships and patterns effectively. In this section, we’ll create a comprehensive Circos visualization that displays:
The visualization will help us identify any potential chromosomal regions that are particularly responsive to hypoxic conditions and understand the genome-wide distribution of our differentially expressed genes.
# Load required libraries
library(biomaRt)
#' Function to get human gene coordinates from Ensembl
get_human_ensembl_coordinates <- function() {
# Connect to Ensembl
message("Connecting to Ensembl...")
ensembl <- useEnsembl(biomart = "genes")
# Find and set human dataset
datasets <- listDatasets(ensembl)
human_datasets <- grep("human", datasets$description, ignore.case = TRUE)
if (length(human_datasets) == 0) {
stop("No human dataset found in Ensembl")
}
dataset <- datasets$dataset[human_datasets[1]] # Use first human dataset
message(paste("Using dataset:", dataset))
# Set dataset and get gene coordinates
ensembl <- useDataset(dataset = dataset, mart = ensembl)
# Define attributes for gene coordinates
attributes <- c("ensembl_gene_id", "chromosome_name",
"start_position", "end_position")
# Get coordinates for all genes
message("Retrieving gene coordinates...")
all.genes <- getBM(attributes = attributes,
values = list(ensembl_gene_id = c()),
mart = ensembl)
# Rename column to match DESeq2 output
colnames(all.genes)[1] <- "ensembl_id"
return(all.genes)
}
# Get gene coordinates
gene_coords <- get_human_ensembl_coordinates()
# Merge coordinates with DESeq2 results
message("Merging coordinates with DESeq2 results...")
merged_data <- merge(gene_coords, res_2, by = "ensembl_id", all.x = TRUE)
# Format chromosome names
merged_data$chromosome_name <- paste0("chr", merged_data$chromosome_name)
# Save complete dataset
output_dir <- "data"
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
write.csv(merged_data,
file.path(output_dir, "deseq.csv"),
row.names = FALSE)
# Create and save subset of interesting genes
if (exists("genes_to_check")) {
merged_data_subset <- merged_data[merged_data$Gene.Name %in% genes_to_check, ]
write.csv(merged_data_subset,
file.path(output_dir, "deseq_subset.csv"),
row.names = FALSE)
message(paste("Saved subset of", nrow(merged_data_subset), "genes"))
}# Load required packages
library(circlize)
library(dplyr)
# Prepare data for Circos plot
circos_data <- merged_data %>%
# Filter out non-standard chromosomes
filter(grepl("^chr[0-9XY]+$", chromosome_name)) %>%
# Sort by chromosome and position
arrange(chromosome_name, start_position) %>%
mutate(log2fc = ifelse(is.na(log2FoldChange), 0, log2FoldChange))
# Initialize Circos plot with chromosomes
circos.clear()
message("Initializing Circos plot...")
# Set up Circos plot parameters
circos.par(
"track.height" = 0.1,
"start.degree" = 90,
gap.degree = 2
)
# Create base chromosome ideogram
circos.initializeWithIdeogram(
species = "hg19",
chromosome.index = paste0("chr", c(1:22, "X", "Y"))
)
# Add track for log2 fold changes
circos.genomicTrack(
circos_data,
panel.fun = function(region, value, ...) {
circos.genomicPoints(
region,
value$log2fc,
pch = 16,
cex = 0.5,
col = ifelse(value$log2fc > 0, "red", "blue")
)
},
track.height = 0.2
)
# Add track for significant genes
message("Adding significant genes track...")
sig_genes <- circos_data %>% filter(padj < 0.05)
if (nrow(sig_genes) > 0) {
# Prepare data for labels, ensuring correct column order
sig_genes_for_labels <- sig_genes %>%
dplyr::select(chromosome_name, start_position, end_position, Gene.name, log2fc)
circos.genomicLabels(
sig_genes_for_labels,
labels.column = 4, # Gene.name column
side = "inside",
col = ifelse(sig_genes_for_labels$log2fc > 0, "red", "blue")
)
}
# Save plot
pdf("figures/circos_plot.pdf", width = 10, height = 10)
dev.off()Figure 32: Circos plot
1. Cell Line-Specific Responses - LNCaP cells show strong androgen-dependent responses - PC3 cells exhibit enhanced EMT and invasiveness signatures - Distinct metabolic adaptations in each cell line
2. Shared Hypoxia Response - Common upregulation of glycolysis pathways - Coordinated angiogenic response - HIF-1α target gene activation
3. Metabolic Reprogramming - Shift from oxidative phosphorylation to glycolysis - Cell-specific lipid metabolism adaptations - Differential nutrient utilization strategies
1. Cancer Progression - Hypoxia drives metabolic adaptation - Enhanced survival mechanisms - Potential metastatic capabilities
2. Therapeutic Relevance - Cell-type specific vulnerabilities - Common therapeutic targets - Resistance mechanisms
3. Clinical Applications - Biomarker identification - Treatment stratification - Drug response prediction
1. Additional Investigations - Validation of key findings - Protein-level confirmation - Metabolomic profiling
2. Therapeutic Development - Target validation studies - Drug combination strategies - Resistance mechanism studies
3. Clinical Translation - Biomarker development - Patient stratification - Treatment optimization
#> R version 4.5.0 (2025-04-11 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C LC_TIME=English_United States.utf8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] grid stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] viridis_0.6.5 viridisLite_0.4.2 showtext_0.9-7 showtextdb_3.0
#> [5] sysfonts_0.8.9 pdftools_3.5.0 DOSE_4.2.0 Rtsne_0.17
#> [9] ggupset_0.4.1 EnhancedVolcano_1.26.0 cowplot_1.2.0 scales_1.4.0
#> [13] janitor_2.2.1 here_1.0.1 biomaRt_2.64.0 org.Hs.eg.db_3.21.0
#> [17] AnnotationDbi_1.70.0 pathview_1.48.0 enrichplot_1.28.4 clusterProfiler_4.16.0
#> [21] fgsea_1.34.2 gridExtra_2.3 RColorBrewer_1.1-3 ggrepel_0.9.6
#> [25] ComplexHeatmap_2.24.1 pheatmap_1.0.13 Matrix_1.7-3 data.table_1.17.8
#> [29] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
#> [33] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
#> [37] ggplot2_3.5.2 tidyverse_2.0.0 edgeR_4.6.3 limma_3.64.3
#> [41] DESeq2_1.48.2 SummarizedExperiment_1.38.1 Biobase_2.68.0 MatrixGenerics_1.20.0
#> [45] matrixStats_1.5.0 GenomicRanges_1.60.0 GenomeInfoDb_1.44.2 IRanges_2.42.0
#> [49] S4Vectors_0.46.0 BiocGenerics_0.54.0 generics_0.1.4 htmltools_0.5.8.1
#> [53] kableExtra_1.4.0 knitr_1.50
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.5.0 klippy_0.0.0.9500 bitops_1.0-9 ggplotify_0.1.2
#> [5] filelock_1.0.3 R.oo_1.27.1 graph_1.86.0 XML_3.99-0.19
#> [9] lifecycle_1.0.4 httr2_1.2.1 doParallel_1.0.17 rprojroot_2.1.1
#> [13] vroom_1.6.5 lattice_0.22-6 magrittr_2.0.3 sass_0.4.10
#> [17] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10 ggtangle_0.0.7
#> [21] askpass_1.2.1 DBI_1.2.3 abind_1.4-8 R.utils_2.13.0
#> [25] RCurl_1.98-1.17 yulab.utils_0.2.1 rappdirs_0.3.3 circlize_0.4.16
#> [29] GenomeInfoDbData_1.2.14 tidytree_0.4.6 svglite_2.2.1 codetools_0.2-20
#> [33] DelayedArray_0.34.1 xml2_1.4.0 tidyselect_1.2.1 shape_1.4.6.1
#> [37] aplot_0.2.8 UCSC.utils_1.4.0 farver_2.1.2 BiocFileCache_2.16.2
#> [41] jsonlite_2.0.0 GetoptLong_1.0.5 iterators_1.0.14 systemfonts_1.2.3
#> [45] foreach_1.5.2 tools_4.5.0 progress_1.2.3 treeio_1.32.0
#> [49] snow_0.4-4 Rcpp_1.1.0 glue_1.8.0 SparseArray_1.8.1
#> [53] xfun_0.53 qvalue_2.40.0 withr_3.0.2 fastmap_1.2.0
#> [57] digest_0.6.37 timechange_0.3.0 R6_2.6.1 gridGraphics_0.5-1
#> [61] qpdf_1.4.1 textshaping_1.0.3 colorspace_2.1-1 GO.db_3.21.0
#> [65] RSQLite_2.4.3 R.methodsS3_1.8.2 prettyunits_1.2.0 httr_1.4.7
#> [69] S4Arrays_1.8.1 pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
#> [73] XVector_0.48.0 clue_0.3-66 png_0.1-8 snakecase_0.11.1
#> [77] ggfun_0.2.0 rstudioapi_0.17.1 tzdb_0.5.0 reshape2_1.4.4
#> [81] rjson_0.2.23 nlme_3.1-168 curl_7.0.0 cachem_1.1.0
#> [85] GlobalOptions_0.1.2 parallel_4.5.0 pillar_1.11.0 vctrs_0.6.5
#> [89] dbplyr_2.5.0 cluster_2.1.8.1 Rgraphviz_2.52.0 evaluate_1.0.5
#> [93] KEGGgraph_1.68.0 cli_3.6.5 locfit_1.5-9.12 compiler_4.5.0
#> [97] rlang_1.1.6 crayon_1.5.3 plyr_1.8.9 fs_1.6.6
#> [101] stringi_1.8.7 BiocParallel_1.42.1 assertthat_0.2.1 Biostrings_2.76.0
#> [105] lazyeval_0.2.2 GOSemSim_2.34.0 hms_1.1.3 patchwork_1.3.2
#> [109] bit64_4.6.0-1 KEGGREST_1.48.1 statmod_1.5.0 igraph_2.1.4
#> [113] memoise_2.0.1 bslib_0.9.0 ggtree_3.16.3 fastmatch_1.1-6
#> [117] bit_4.6.0 ape_5.8-1 gson_0.1.0